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ABSTRACT 

Two observers who were using an electronic digital 
data acquisition system were spot checked for reliability at random 
times over a four month period. Between-and within-observer 
reliability was assessed for frequency, duration, and 
duration-per-event measures of four infant behaviors. The results 
confirmed the problem of observer drift— the fluctuations of scores 
across sessions— for the frequency and doration-per-event measures. 
In contrast, the "real time" duration scores were stable across 
sessions, indicating the robustness of this measure of behavior. 
(Author) 



MEASURES OF RELIABILITY IM BEHAVIORAL OBSERVATION: 
THE ADVANTAGE OF "REAL TIME" DATA ACQUISITION 
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University o£ Washington 



Abstract 

Two observers x^ho were using an electronic digital data acquisition 
system were spot checked for reliability at randoa times over a four month 
period. Betx^een- and wlthin-observer reliability was assessed for fre- 
quency* duration, and duration-per-event measures of four Infant 
behaviors. The results confirmed the problem of observer drift— the 
fluctuations of scores across sessions—for the frequency and duration- 
per-event measures. In contrast, the "real time" duration scores were 
stable across sessions, indicating the robustness of this measure of 
behavior. 
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mSURKS OF RELIABILITY IN BEHAVIORAL OBSERVATION: 
THE ADVANTAGE OF "REAL TIME" DATA ACQUISITION^ 

2 3 
Albert R. Hollenbeck and Ronald G. Slaby 

University o£ Washington 

A number o£ recent studies have been addressed to the problem of 
assessing reliability in observational research (e.g., Reid, 1970; 
Johnson and Bolstad, 1973$ VThelan, 1974). This interest in the more subtle 
aspects o£ the observational process reflects a general increase in the use 
of direct observational measures in psychological research. Reliability in 
observational research generally meanc that twc or more observers independ- 
ently record the same naturally occurring behavioral events in a similar 
way. It has been generally assumed that if two observers achieve a high 
level of inter-observer reliability, then they are measuring the same 
aspects of behavior across sessions as well. Yet, Reid (1970) has demon- 
strated that this may not be the case. His data indicate that whereas 
reliability between observers can remain at a constant level, reliability 
across sessions for a given observer tends to decrease with time. This 
problem of observer drift, i.e., the fluctuation of observations across 
sessions, has important implications both for the interpretations of data 
already collected and for the collection of data in future research. 

A second problem in establishing reliability is that of selecting an 
appropriate statistical index. The typical measure of reliability has 
been some siumnary statistic such as a percentage agreement score or a 
correlation coefficient. It has been generally assumed that these tradi- 
tional statistical measures used to compute reliability are valid. Howevex, 
percentage agreement and correlation measures of reliability have recently 
come under justifiable criticism (e.g., Hartmann, 1974) for over-estimating 
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reliability as well as for being insencitive to the detection of chance 
agreements. 

In order to dvetcome these difficulties in establishing reliability, 
systematic attempts have hQan made to evaluate different components of the 
observational process (e.g., Mash, 1973; Taplin and Reid, 1973; Havm, 
Brovn, and LeBlanc, 1973). These evaluations have uncovered several 
questions underlying the basic assumptions of the behavioral observation 
methods. For example, in the majority of observational studies, the most 
common measure used has been the frequency of occurrence of some behavior- 
al event. In fact, moBt studies have been time-sampled in such a way that 
actual frequencies are not scored. Rather, a modified-frequeicy score 
(i.e., a score based on the number of arbitrarily defined time intervals 
in which an event has occurred) is used to mark simple occurrence or non- 
occurrence of a behavioral event. The very nature of modified-frequency 
measurement is suspect, since actual frequencies and durations are con- 
founded. 

Recent advances in technology have provided electronic systems 
which allow the unconfounded recording of actual frequency and duration 
scores. MISCARS and the Behavioral Observation Scoring System (BOSS) are 
two such systems (Sackett, Stephenson, and Ruppenthal, 1973). These 
advances, which allow the experimenter to measure exact frequencies and 
durations separately, raise several interesting questions. How does the 
reliability of real frequency and real duration measures compare with 
that of the modified-frequency measures typically used in previous 
research? Are the unconfounded frequency and duration measures subject 
to fluctuations in observations across sessions, as reported by Reid and 
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others for modifled-frequency scores? The purpose of this study was to 
examine the reliability of observations based on real frequency and real 
duration measures* This study was designed to assess obscsrver drift in 
the reliability of these measures* 

Method 

Subjects 

Two female undergraduates at the University of Washington served as 
observers in this study. Both observers were volunteers who received 
academic credit for applied field work In psychology. 

Apparatus 

A videotape machine was used to record the responses of a six-month- 
old infant to a stimulus presentation designed to elicit vocalizations 
from the Infant. This videotaped sequence was presented to the observers 
for purposes of assessing observer reliability. 

The behavior code was a modified version of the one previously used 
by Hollenbeck (1971). Five mutually exclusive and exhaustive behavioral 
categorles--Vocalizatlon, Head Movement $ Arm Movement, Body Movement, and 
Mo Behavior — were hierarchically arranged and scored on a priority basis* 
Specifically, Vocalizatlon& made by the infant took scoring preference 
over Head Movements when both behaviors occurred tsimultaneously. In the 
same way. Head Movements were scored over Arm Movements; Arm Movements 
over Body Movements; and any movements or vocalization took scoring 
preference over the No Behavior category. 

The Behavioral Observation Scoring System (BOSS) was used to record 
the coded data. BOSS is an electronic digital data acquisition system 



4 

developed at the University o£ Washington Child Development and Mental 
Retardation Center and the University of Washington Primate Center. This 
system allows behavioral events to be recorded in terms o£ their actual 
frequencies and durations and stored electronically on a magnetic cas" 
sette audio tape recorder. The cassette data tapes can then be played 
through an appropriate Interface into a computer for analysis of the data. 
A detailed descrli..:ion of BOSS is presented in Saclf.ett» et al. (1973). 
Procedure 

The observers were recruited from an undergraduate psychology 
course by means of an announcement asking for volunteers to participate 
in an observational study of infants. Academic credit in independent 
field work was offered at a later time. 

Training. The observers T>7ere trained in four phases. First, 
Observer A coded the videotape sequence stating each code aloud as it 
occurred. Observer B then attempted to follow Observer A's coding, but 
using her own choice of codes where disagreements occurred. Second, the 
two observers discussed their disagreements with the experimenter after 
each coding session. All disagreements were resolved by mutual agree- 
ment. Third, the procedure was reversed and Observer B stated the code 
while Observer A recorded silently. Finally^ a third pass through the 
videotape was made V7ith each observer recording silently and independently. 
This entire training procedure was repeated twice a week for one month. 
At the end of the training period observers were presented a new segment 
of the videotape and asked to code the tape independently. On two succes- 
sive codings of new material the observers achieved frequency percentage 
agreement scores greater than 90 per cent for each trial. The first two 



5 

checks after criterion agreement was reached consisted of part-new and 
part-old segments of the videotape. The mean percentage agreement between 
the two observers for the frequency scores of the five behavioral cate- 
gories was 97.8 per cent and 94 per cent, respectively* These percentages 
were significantly greater than a pre-established criterion of 80 per cent 
agreement. Duration measures of reliability were not computed. 

Data collection . Each observer was Instructed that the primary 
purpose of the study was to gather Information about Infants. Observers 
were told that at random Intervals their observer agreement would be 
checked; however, they received no advance warning of the checks. During 
the four months after the Initial training the observers were "spot 
checked" five times for reliability on the same segment of the videotape 
sequence. Spot checking is a commonly employed procedure whereby reli- 
ability is assessed periodically rather than continuously (see Taplln and 
Raid, 1973) . In this case the duration between the five checks varied 
from two to four weeks* The same segment of videotape was used for each 
check and all checks were taken Independently for each observer. Between 
sessions, observers actually scored the behavior of Infants partlcipatlag 
in the infant research project. This procedure, with its long and vari- 
able duration between checks and its interposed coding task, was designed 
to minimize observer expectation and simple recall. In fact, the 
observers verbally reported a vague sense of what was on the videotape, 
but had a difficult time recalling any specifics. 

Data analysis . Measures of (1) frequency, (2) durati«n-per-event, 
and (3) duration were taken from the same presentation for a standard 
trial length (5.5 minutes). A multiple regression analysis using 
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backward deletions (In which each main and interaction factor was sequen- 
tially deleted from the total variance) was performed on each o£ the 
three dependent measures t as suggested by Cohen (1968). The variables in 
these regression analyses included Observers (2), Sessions (5), 4 separate 
behavioral (Categories, and their interactions. In addition, a trend 
analysis across sessions was included. Based on these regression analyses 
three analyses of variance were computed. 

Results 

Frequency data . The analysis of variance for frequency scores is 
presented in Table 1. The analysis revealed a significant linear trend 



Insert Table 1 about here. 



(p < .001) across sessions. Further variation across sessions was 
characterized by a significant quartic trend (p < .025). Each of the 
tour behavioral categories (Vocalizations, Head Movement s. Arm Movements, 
and Body Movements) differed from the category of No Behavior against 
which they were contrasted. Finally, the Observer X Vocalization inter- 
action was significant (p < .001), indicating variation between observers 
in their recording of frequencies of Vocalization in contrast to those of 
the No Behavior category. Observer B scored Vocalization more frequently 

than Observer A. The regression analysis for frequency scores revealed 

2 

that a significant amount of variability (R » .95) was accounted for by 
the four behavioral categories tested against the category of No Behavior. 
Duration~per~event data . The analysis of variance for duration-per- 
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event scores Is presented In Table 2. The findings for duration-per-event 
scores were similar to those for the frequency scores. 

Insert Table 2 about here* 



Specifically, a significant linear trend (p < .001) was revealed, indica- 
ting significant variation across sessions. Further variation across 
sessions was characterized by a significant quadratic trend (p < .05). 
Each of the four behavioral categories differed from the category of No 
Behavior against which they were contrasted. Observers showed significant 
overall differences (p < .001) in their durations-per-event scores across 
all behavioral categories and all sessions. Observer A scored longer 
duratlons-per-event than Observer B. In addition, the Observer X Vocali- 
zation and the Observer X Body Movements interactions were significant, 
indicating variation between observers in their recording of the durations- 
per-event of these two behaviors in contrast to those of the No Behavior 
category. As was the case for frequency scores, a significant amount of 
variability (R ■ .83) was accounted for by the four behavioral categories 
tested against the category of No Behavior. 

Duration data . The analysis of variance for duration scores is 
presented in Table 3. In contrast to both the frequency and the duration- 
per-event measures, the duration measure was stable across sessions. 

Insert Table 3 about here. 



Again, each of the four behavior categories differed from the category of 
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No Behavior aaaiust which they were concrastcd. Although observers 
showed no overall differences in their duration scores, Observer X 
Behavior Category interactions were significant for eaca of the four 
behaviors in contrast to the category of No Behavior* The regression 

analysis for duration scores revealed that a significant amount of varl- 

2 

ability (R » .77) was accounted for by the four behavioral categories 
tested against the category of No Behavior. 

Discussion 

These results confirm previous findings of observer drift, i.e., 
the fluctuation of observations across sessions. Consistent with Reid*s 
(1970) finding of observer drift for modified-frequency scores, the uncon- 
founded real frequency score used in the present study showed large 
fluctuations across sessions (see Figure 1). In addition, considerable 
observer drift was noted for the duration-per-event measure. The pattern 
of fluctuation of these scores was not characterized by a sharp decrement 
followed by a stable level of performance, as found by Taplln and Raid 
(1973). Rather, these scores showed intermixed rises and declines across 
sessions, as Indicated by significant quartlc and quadratic trends for 
frequency and duration-per-event measures, respectively. One possible 
explanation for this additional fluctuation is that the amount of time 
between sessions was both longer and more variable than has been character 
Istic of previous studies of reliability. 

In contrast to the findings for the dependent measures directly 
related to frequency data (i.e., modified-frequency, real frequency, and 
duration-per-event measures), real duration scores showed no observer 
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drift (see Figure 1). Duration scores were gcnerAlly stable across ses- 
sions. The greater stability of duration scores laay be attributable to 
several factors. It may be that across sessions observers tend to discri- 
minate an increased nuiuber of discrete events, each of shorter duration. 
This Is suggested in the present findings by an Increase across sessions 
In frequency scores and a concurrent decrease across sessions In the 
duratlon-per-event scores. However, provided that event frequencies are 
recorded in basically the same categories over sessions, total duration 
scores for each category would be expected to remain relatively unaffected 
by this trend toward finer discrimination events. 

A second factor contributing to greater stability of duration scores 
Is that duration, unlike frequency. Is by definition a weighted measure. 
Specifically, a duration score is more heavily x^eighted than a frequency 
score to the extent that the durations of observable events are long. The 
longer the duratlons-per-event are for a given behavior, the heavier is 
the weighting of the duration score as compared to the frequency score. 
Thus, minor fluctuations in scoring across sessions would be expected to 
affect duration scores (with their greater weight) relatively less than 
frequency scores. Figure 1 Illustrates the finding that the percentage 
difference of session means from the grand mean Is relatively smaller for 
duration scores than for frequency scores. 

Insert Figure 1 about here. 

An interesting secondary finding suggests alternative means of 
assessing between-observer reliability. It was found that the majority 
of the variance was accounted for among the Behavioral Categories being 
coded, rather than between Observers. For frequency, duratlon-per-event, 
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and duration scores* respectively, the proportion of the variance accounted 
for among behavioral categories was •97, .83» and »77t whereas the propor- 
tion of the variance accounted for between Observers was •02, J.1, and »002» 
This laplles that the between-ol^seiver reliability was high for all three 
measures* Traditional measures of reliability support this notion Insofar 
as both the average percentage agreement and the average correlation co- 
efficient between observers was greater than .90 for frequency scores 
obtained In the first training session* Nevertheless, the analyses of 
variance revealed that observers differed significantly In their recording 
of at least one behavior for each of the dependent measures* Furthermore, 
observers showed overall differences across all four behaviors In their 
duratlon-per-event scores. These findings indicate that the analyses of 
variance provide a more sensitive test of differences between observers 
than do the traditional measures of between-observer reliability* 

One possible point of criticism specific to this analysis Is that 
total duration summed across all coding categories Is completely determined 
by the standard trial length* Since total duration cannot vary from 
session to session, one might conclude that the reported stability of 
duration across sessions Is trivial* However, the reported stability was 
based on four behavior codes which together accounted for an average of 
only 40 per cent of the total duration; the category of No Behavior 
accounted for the other 60 per cent. Since the duration scores for the 
four behavior categories were thus free to vary, the stability of duration 
of these individual behaviors across sessions was a meaningful finding* 

Taken together, these findings suggest that to properly Interpret 
measures of reliability the robustness of "real time" duration measures 
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must be considered. Based on the findings ol the present study* duration 
measures appear to be less suceptible to observer dri£t. In addition, 
assessment of reliability through analyses of variance and regression 
analyses should be further explored, considering the potential advantages 
in precision and sensitivity. 
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Table 1 

Analysis of Variance for Frequency 



Source 


df 


MS 


F 




12 


1420.51 












UuSerVer \J) 


1 
J* 


7 22 


Tlfl 


Vocallzatlcns (V) 


1 






Head Movement (H) 


1 




J •DO'*'* 


Arm Movement (A) 


1 


1 eon 


<}ni £A^A 


Body Movement (B) 


1 


6993.80 


x3o7 • 60'** 


Trends 








Linear Trend (T) 


1 


153.76 


30.51** 


v^uartlc Trend 


1 


28.28 


5.61** 


Cubic Trend^ 








Quadratic Trend 








0 X T 


1 


12.96 


ns 


0 X V 


1 


36.98 


7.34*** 


0 X H 


1 


20.83 


4.13* 


0 X A 


1 


6.67 


ns 


0 X B 


1 


20.00 


3.97* 



®The quadratic and cubic trends were eliminated from 
the analysis by t he computer program due to the 
small MS attributable to these factors. 

* 

p < .05 
p < .025 
p < .001 
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Table 2 



Analysis of Variance for 
Duration-per-Event 





dr 


MS 


F 


Total 


14 


611.41 




Residual (error) 


35 


5.56 




Observer (0) 


1 


121.68 


21.88*** 


Vocalizations (V) 


1 


1335.28 


240.16*** 


Head Movement (H) 


1 


963.33 


173.26*** 


Arm Movement (A) 


1 


3168.27 


569.83*** 


Body Movement (B) 


1 


2690.90 


483.97*** 


Trends 








luJLu,&CUt A*CWU V*/ 




156.25 


28.10*** 


Quadratic Trend 




30.18 


5.43* 


Cubic Trend 




1.00 


ns 


Quartic Trend 




16.05 


us 


C X T 




13.69 


ns 


0 X V 




30.42 


5.47* 


0 X H 




.83 


ns 


0 X A 




1.67 


ns 


0 X B 




33.80 


6.08** 



p < .05 



p < .025 



*** 



p < .001 
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Analysis of Variar»w 2ox Duration 



Source 


d£ 


MS 


F 


Total 


1h 






Residual (error) 


35 






Observer (0) 




£.1 to 


115 


Vocalizations (V) 




1 1 QQ1 CO 




Head Movement (H) 




1061824 •50 




Arm Movement (A) 




5759801.70 


2389. 26*** 


Body Movement (B) 




17611891.00 


7333.03*** 


Trends 








Linear (T) 




27.00 


ns 


Quadratic 




96.00 


ns 


Cubic 




7.00 


ns 


Quartlc 




4.00 


ns 


0 X T 




27.00 


ns 


0 X V 




12200.00 


5.06* 


0 X H 




49045.00 


20.34*** 


0 X A 


1 


23602.00 


9.79** 


0 X B 


1 


61829.00 


25.65*** 



p < .05 
**p < .01 
***p < ,001 



Figure Caption 

Percentage difference of session means (based on the 
four Infant behaviors) from the grand mean for frequency 
and duration scores* 
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